Data Description

1. Data source: The data is available on Kaggle and was originally sourced from an anonymous US insurance company.

2. What the data are: The dataset contains information on insurance policyholders, including variables such as age, sex, BMI, number of children, smoker status, region, and charges (the amount charged by the insurance company for health coverage). Some of these variables are categorical (e.g., sex, smoker status, region) and some are numerical (e.g., age, BMI, charges).

3. Sample size (n) and number of variables (k): The dataset has 1,338 rows (policyholders) and 7 columns (variables).

4. Why the data are of interest: The data can be used to explore various questions related to insurance policy holders, such as whether certain demographic groups are more likely to smoke, whether there is a relationship between BMI and charges, and whether charges vary by region. This information can be useful for insurance companies in setting prices and designing policies, as well as for researchers studying health outcomes and disparities.

5. Disclaimer: This dataset is very skewed, and using our current knowledge, there may not be enough information to correctly transform the model.

1. Question To be answered:

Can we predict the charges of an insurance policyholder based on their age, gender, BMI, number of children, smoker status, and region of residence?

2. Collect and organize the data:

library(readxl)
library(car)
## Loading required package: carData
library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
insurance <- read_excel("insurance.xlsx")
insurance$smoker <- relevel(factor(insurance$smoker), ref="yes")

This data is already very complete and cleaned. Thus, there is not much organizing that needs to be done. Every predictor makes sense in predicting total charges for insurance.

full_model <- lm(charges~. , data = insurance)
summary(full_model)
## 
## Call:
## lm(formula = charges ~ ., data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11527.3  -2887.5   -995.5   1422.3  29711.6 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      11909.72    1072.11  11.109  < 2e-16 ***
## age                259.34      12.35  20.999  < 2e-16 ***
## sexmale            -30.15     345.34  -0.087  0.93045    
## bmi                337.45      29.62  11.394  < 2e-16 ***
## children           466.98     142.91   3.268  0.00111 ** 
## smokerno        -23954.39     426.63 -56.148  < 2e-16 ***
## regionnorthwest   -318.01     496.36  -0.641  0.52185    
## regionsoutheast   -913.78     495.21  -1.845  0.06524 .  
## regionsouthwest   -899.42     492.25  -1.827  0.06791 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6121 on 1259 degrees of freedom
## Multiple R-squared:   0.75,  Adjusted R-squared:  0.7484 
## F-statistic: 472.1 on 8 and 1259 DF,  p-value: < 2.2e-16

3. Organize the data for analysis:

This includes checking for mistakes and coding qualitative variables using indicator variables.

str(insurance)
## tibble [1,268 × 7] (S3: tbl_df/tbl/data.frame)
##  $ age     : num [1:1268] 19 18 33 32 31 46 37 60 25 62 ...
##  $ sex     : chr [1:1268] "female" "male" "male" "male" ...
##  $ bmi     : num [1:1268] 27.9 33.8 22.7 28.9 25.7 ...
##  $ children: num [1:1268] 0 1 0 0 0 1 3 0 0 0 ...
##  $ smoker  : Factor w/ 2 levels "yes","no": 1 2 2 2 2 2 2 2 2 1 ...
##  $ region  : chr [1:1268] "southwest" "southeast" "northwest" "northwest" ...
##  $ charges : num [1:1268] 16885 1726 21984 3867 3757 ...

There’s no missing data in this data set and we also don’t have to worry about the categorical data because R will make them indicator variables as we run lm later.

4. Graph the data and calculate some summary statistics:

To get a feel for the data set—this can also alert you to potential data entry mistakes and might suggest some variable transformation.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
insurance_num <- select_if(insurance, is.numeric)
pairs(insurance_num)

Above is the scatter plots for all numeric data.

5. Fit an initial model:

Use each of the potential quantitative predictors and indicator variables for the qualitative predictors. Use untransformed Y as the response variable unless a transformed Y is suggested beforehand [e.g., highly skewed Y is often better analyzed as \(log_e(y)\)]. If there is background knowledge which suggests that particular transformations or interactions might be important, then include them in this initial model.

summary(full_model)
## 
## Call:
## lm(formula = charges ~ ., data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11527.3  -2887.5   -995.5   1422.3  29711.6 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      11909.72    1072.11  11.109  < 2e-16 ***
## age                259.34      12.35  20.999  < 2e-16 ***
## sexmale            -30.15     345.34  -0.087  0.93045    
## bmi                337.45      29.62  11.394  < 2e-16 ***
## children           466.98     142.91   3.268  0.00111 ** 
## smokerno        -23954.39     426.63 -56.148  < 2e-16 ***
## regionnorthwest   -318.01     496.36  -0.641  0.52185    
## regionsoutheast   -913.78     495.21  -1.845  0.06524 .  
## regionsouthwest   -899.42     492.25  -1.827  0.06791 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6121 on 1259 degrees of freedom
## Multiple R-squared:   0.75,  Adjusted R-squared:  0.7484 
## F-statistic: 472.1 on 8 and 1259 DF,  p-value: < 2.2e-16

6. Check the four regression assumptions for this initial model using residual plots.

If there is a strong suggestion that one or more of the assumptions is violated, then proceed to step 7; otherwise, if everything checks out, proceed to step 8.

par(mfrow=c(2,3))
plot(full_model)
plot(full_model$residuals, main = "Residuals vs Order Plot")

  1. Linearity: The relationship between the independent variables and the dependent variable should be linear. However, the residuals vs Fitted model seems to have a coned shape pattern, then linearity may be violated.

  2. Normality: The residuals should be normally distributed. However, the QQ plot seems to be curved up. Since it shows a non-normal distribution, then normality may be violated.

  3. Constance Variance: The variance of the residuals should be constant across all levels of the independent variables. However, the residuals vs Fitted model. The spread of the residuals increases as the predicted values increase.

  4. Independence: The residuals should be independent of each other. There does not seem to be a pattern in the residuals vs. ordered plot.

Fixing Assumptions

Partial F-test for Suspicious p-values

According to the p-values found in the full model, we may have suspicions regarding whether or not \(sex\) and \(region\) have a strong influence to the \(charges\). We can run a partial F-test for this.

reduced <- lm(charges ~ age + bmi + children + factor(smoker), data = insurance)
summary(reduced)
## 
## Call:
## lm(formula = charges ~ age + bmi + children + factor(smoker), 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12036.1  -2936.3   -972.8   1381.6  29342.7 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       11739.77    1028.49   11.41  < 2e-16 ***
## age                 260.05      12.34   21.07  < 2e-16 ***
## bmi                 323.06      28.37   11.39  < 2e-16 ***
## children            462.83     142.83    3.24  0.00122 ** 
## factor(smoker)no -23921.33     424.34  -56.37  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6123 on 1263 degrees of freedom
## Multiple R-squared:  0.749,  Adjusted R-squared:  0.7482 
## F-statistic: 942.3 on 4 and 1263 DF,  p-value: < 2.2e-16
anova(reduced, full_model)

Hypothesis:

\(H_0:\) \(\beta_2\) = \(\beta_6\) = 0

\(H_a:\) At least one of the \(\beta's\) is not 0

Decision Rule:

Reject \(H_0\) if p-value < 0.05

Fail to reject \(H_0\) if p-value is \(\ge\) 0.05

Test Statistic:

p-value = 0.1654

Decision/Conclusion:

Since the p-value is 0.1654, it is greater than 0.05, which means we failed to reject the null hypothesis. In other words, we are able to used the reduced model without \(sex\) and \(region\).

hist(insurance$charges)

Since \(charges\) is highly skewed, we certainly have questions on whether or not we should use log transformation. We will conduct a Box-Cox to determine whether a log transformation on y is truly necessary.

Box Cox

Since normality is very skewed, we can use Box Cox in order to identify what transformation needs to be done.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## The following object is masked from 'package:olsrr':
## 
##     cement
bc <-boxcox(reduced)

lambda <- bc$x[bc$y==max(bc$y)]
lambda
## [1] 0.1414141

With a lambda of 0.1414 repeating, it is closer to 0, which means that we will have to use the log transformation on the response variable. We will use the logged response from here on out.

log_reponse <- lm(I(log(charges)) ~ age + bmi + factor(smoker) + children, data = insurance)
summary(log_reponse)
## 
## Call:
## lm(formula = I(log(charges)) ~ age + bmi + factor(smoker) + children, 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11185 -0.19741 -0.04754  0.06978  2.07769 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.5311400  0.0755440 112.929  < 2e-16 ***
## age               0.0349574  0.0009067  38.556  < 2e-16 ***
## bmi               0.0105818  0.0020839   5.078 4.39e-07 ***
## factor(smoker)no -1.5515840  0.0311681 -49.781  < 2e-16 ***
## children          0.1000297  0.0104910   9.535  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4497 on 1263 degrees of freedom
## Multiple R-squared:  0.7628, Adjusted R-squared:  0.7621 
## F-statistic:  1016 on 4 and 1263 DF,  p-value: < 2.2e-16
par(mfrow=c(2,3))
plot(log_reponse)
hist(log_reponse$residuals)
plot(log_reponse$residuals)

By logging just the response, the residuals vs fitted chart became a cone shape, which means that linearity is still a problem. There is an obvious curvature that is shown with this model from the residuals vs fitted chart. However, equal spread has improved. Normality is still a problem, and independence is still not violated.

Scatterplot of predictors

pairs(insurance_num)

Since we know that we must log the response, we can test to see which response variable should be logged. According to the scatter plot matrix, we may need to log \(children\), and maybe \(bmi\) as well since those are ones where the graphs are all cluttered together in an uniform area.

VIF

vif(reduced)
##            age            bmi       children factor(smoker) 
##       1.014327       1.011952       1.001579       1.001196

We can also use the matrix to look for VIF, as shown, multicollinearity will not be a problem in this case.

Log Transformations

Log Children

Note: we have to log \(children + 1\) due to the fact that we cannot log 0 values, and some of the values for children contains 0.

logged_children <- lm(I(log(charges)) ~ age + bmi + factor(smoker) + 
                          I(log(children + 1)), data = insurance)
summary(logged_children)
## 
## Call:
## lm(formula = I(log(charges)) ~ age + bmi + factor(smoker) + I(log(children + 
##     1)), data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09145 -0.20116 -0.05053  0.07288  2.08333 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           8.5161293  0.0755772 112.681  < 2e-16 ***
## age                   0.0348967  0.0009052  38.553  < 2e-16 ***
## bmi                   0.0104753  0.0020801   5.036 5.44e-07 ***
## factor(smoker)no     -1.5502161  0.0311097 -49.831  < 2e-16 ***
## I(log(children + 1))  0.2219038  0.0226404   9.801  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4489 on 1263 degrees of freedom
## Multiple R-squared:  0.7637, Adjusted R-squared:  0.763 
## F-statistic:  1021 on 4 and 1263 DF,  p-value: < 2.2e-16
par(mfrow=c(2,3))
plot(logged_children)
plot(logged_children$residuals)

Logging children did improve the adjusted R-sq by a tiny bit; linearity and equal spread improved as well. They are still not the best however. We can try to logging \(bmi\) to see if will better our model.

Log BMI

logged_bmi <- lm(I(log(charges)) ~ age + I(log(bmi)) + factor(smoker) + 
                          children, data = insurance)
summary(logged_bmi)
## 
## Call:
## lm(formula = I(log(charges)) ~ age + I(log(bmi)) + factor(smoker) + 
##     children, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06754 -0.19480 -0.05033  0.07082  2.07333 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.7097262  0.2148210  35.889  < 2e-16 ***
## age               0.0348984  0.0009062  38.509  < 2e-16 ***
## I(log(bmi))       0.3375252  0.0628408   5.371 9.31e-08 ***
## factor(smoker)no -1.5520986  0.0311298 -49.859  < 2e-16 ***
## children          0.1000456  0.0104786   9.548  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4492 on 1263 degrees of freedom
## Multiple R-squared:  0.7634, Adjusted R-squared:  0.7626 
## F-statistic:  1019 on 4 and 1263 DF,  p-value: < 2.2e-16
par(mfrow=c(2,3))
plot(logged_bmi)
plot(logged_bmi$residuals)

After logging only BMI, the assumption plots are still very similar, and the adjusted R-sq is very close to what we had before.

Log Children and BMI

logged_children_bmi <- lm(I(log(charges)) ~ age + I(log(bmi)) + factor(smoker) + I(log(children + 1)), data = insurance)
summary(logged_children_bmi)
## 
## Call:
## lm(formula = I(log(charges)) ~ age + I(log(bmi)) + factor(smoker) + 
##     I(log(children + 1)), data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.04787 -0.20324 -0.05391  0.07524  2.09095 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           7.7013872  0.2144374  35.914  < 2e-16 ***
## age                   0.0348375  0.0009047  38.506  < 2e-16 ***
## I(log(bmi))           0.3346014  0.0627231   5.335 1.13e-07 ***
## factor(smoker)no     -1.5507229  0.0310711 -49.909  < 2e-16 ***
## I(log(children + 1))  0.2219591  0.0226129   9.816  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4483 on 1263 degrees of freedom
## Multiple R-squared:  0.7643, Adjusted R-squared:  0.7636 
## F-statistic:  1024 on 4 and 1263 DF,  p-value: < 2.2e-16
par(mfrow=c(2,3))
plot(logged_children_bmi)
plot(logged_children_bmi$residuals)

Logging \(children\) and \(bmi\) did improve the adjusted R-sq by a tiny bit; the graph looks very similar to what we had before. We will continue with this model as it is the best one we have so far with the highest adjusted R-sq value out of the three log transformations.

Interactions

Now that we have the logged transformation, we will like to test out certain interaction variables to see if they are significant to predict charges.

interaction_model <- lm(charges ~ age + bmi + children + factor(smoker) + bmi:smoker, data = insurance)
summary(interaction_model)
## 
## Call:
## lm(formula = charges ~ age + bmi + children + factor(smoker) + 
##     bmi:smoker, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14317  -1957  -1338   -469  29796 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -22573.112   1553.579 -14.530  < 2e-16 ***
## age                 268.199      9.955  26.942  < 2e-16 ***
## bmi                1426.240     48.036  29.691  < 2e-16 ***
## children            499.413    115.135   4.338 1.55e-05 ***
## factor(smoker)no  19876.453   1711.631  11.613  < 2e-16 ***
## bmi:smokerno      -1424.556     54.549 -26.115  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4935 on 1262 degrees of freedom
## Multiple R-squared:  0.8371, Adjusted R-squared:  0.8364 
## F-statistic:  1297 on 5 and 1262 DF,  p-value: < 2.2e-16
par(mfrow=c(2,3))
plot(interaction_model)
plot(interaction_model$residuals)

With the addition of interaction variables, we get a R-sq value of 0.8388, which is the highest R-sq value we have obtained. We also have a fairly high adj R-sq value. Although we are unable to compare the two models directly, we can see that the diagnostic plots for the interaction variable is much better than the logged one. We will be using that as our final model.

Checks

Cook’s Distance

cd <- cooks.distance(interaction_model)
plot(cd)
abline(a=0.5, b=0, col="red")
abline(a=-0.5, b=0, col="red")

According to the Cooke’s distance, none of the observations are greater than 0.5, which means that we do not need to remove any observations from the dataset because they are rarely so influencial.

Standard Residuals

st_red <- rstandard(interaction_model)
plot(st_red)
st_red[st_red > 2]
##        3        8       32       59       98      110      123      132 
## 3.205346 3.144250 2.815317 3.078631 3.889590 3.413665 4.174042 2.838946 
##      133      135      210      218      232      234      277      278 
## 4.042542 2.439497 4.324639 2.296324 4.577266 2.564456 2.680335 2.970067 
##      291      305      323      337      360      367      376      407 
## 2.904730 3.729795 3.077686 3.025589 2.540102 3.759814 2.759557 4.187226 
##      419      443      448      451      464      488      491      495 
## 2.963111 3.638697 2.213716 2.161305 2.195029 4.972269 3.022320 2.412964 
##      496      503      508      522      540      543      548      551 
## 4.182876 2.419813 3.200366 2.804840 3.476956 3.663617 2.272447 2.804164 
##      562      599      619      648      655      712      728      763 
## 4.294331 3.482870 3.175303 3.203065 3.370333 2.304717 2.583914 4.035924 
##      775      813      830      871      878      888      910      914 
## 3.696845 2.769808 3.072564 2.631893 2.751626 4.452207 3.596641 2.872639 
##      929      932      935      951      955      958      964      971 
## 2.676080 2.372797 3.735370 2.232223 3.691846 4.242893 4.470659 3.677156 
##      982     1041     1046     1064     1074     1081     1084     1095 
## 3.866506 2.382366 2.665465 2.799669 3.003317 3.212294 2.499007 2.013193 
##     1100     1132     1142     1146     1150     1164     1191     1232 
## 2.648571 3.025402 4.612277 2.990084 2.173118 3.654485 3.340951 6.051660 
##     1250     1259 
## 2.185366 3.630167
abline(a=2, b=0, col="red")
abline(a=-2, b=0, col="red")

From the standard residuals, we can see that there are several points that have a standard residuals greater than \(abs(2)\). This could be a problem as there are points that are unusal.

Leverage

lev <- hatvalues(interaction_model)
plot(lev)
dim(insurance)
## [1] 1268    7
abline(a=18/1338, b=0, col="red")

There are also several points that have high leverages.

Final Model

Linearity and equal spread have both been improved, but normality is still a problem. Every predictor is significant in the summary.

summary(interaction_model)
## 
## Call:
## lm(formula = charges ~ age + bmi + children + factor(smoker) + 
##     bmi:smoker, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14317  -1957  -1338   -469  29796 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -22573.112   1553.579 -14.530  < 2e-16 ***
## age                 268.199      9.955  26.942  < 2e-16 ***
## bmi                1426.240     48.036  29.691  < 2e-16 ***
## children            499.413    115.135   4.338 1.55e-05 ***
## factor(smoker)no  19876.453   1711.631  11.613  < 2e-16 ***
## bmi:smokerno      -1424.556     54.549 -26.115  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4935 on 1262 degrees of freedom
## Multiple R-squared:  0.8371, Adjusted R-squared:  0.8364 
## F-statistic:  1297 on 5 and 1262 DF,  p-value: < 2.2e-16

Closing Remarks

According to this model, we found out that linearity and equal spread have improved from the original model. Independence is also not a problem as there is not visible pattern from the residuals vs index plot. However, as from before, normality is still a problem.

Normality may not be completely fixable due to the nature of the sample data. A disproportionate number of observations had charges under 20,000 dollars. This causes data points in the histogram to be very right side skewed. This disparity makes sense due to the nature of the healthcare industry-very few individuals incur a premium greater than 20,000, even 10,000.

However, since we are using the model to predict, the assumptions do not play a huge role in that aspect. With our models, our predictions are fairly close to what they intended to be with some certain points being worse/better than others. Thus, in the end, our model was able to predict the charges to a fair degree.

Interaction Variables

First, we will insert all of the possible interaction variables, and we can use backwards elimination in order to query the best model.

interaction_model <- lm(charges ~ age + bmi + factor(smoker) + children + 
                          bmi:smoker + age:bmi + age:smoker + age:children + 
                          bmi:children + children:smoker, data = insurance)
ols_step_backward_p(interaction_model)
## 
## 
##                                Elimination Summary                                
## ---------------------------------------------------------------------------------
##         Variable                      Adj.                                           
## Step      Removed       R-Square    R-Square     C(p)        AIC          RMSE       
## ---------------------------------------------------------------------------------
##    1    age:smoker        0.8376      0.8364    9.0086    25176.9612    4935.3016    
##    2    bmi:children      0.8376      0.8365    7.0447    25174.9977    4933.4122    
##    3    age:children      0.8376      0.8367    5.1009    25173.0543    4931.5642    
## ---------------------------------------------------------------------------------

From backwards elimination, we can remove \(bmi:children\), \(age:children\), \(age:smoker\), \(age:bmi\), and \(factor(region)\). We can then conduct a partial F-test to see if we are able to use the elimination model.

best_model <- lm(charges ~ age + bmi + factor(smoker) + children + 
                 bmi:smoker, data = insurance)
summary(best_model)
## 
## Call:
## lm(formula = charges ~ age + bmi + factor(smoker) + children + 
##     bmi:smoker, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14317  -1957  -1338   -469  29796 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -22573.112   1553.579 -14.530  < 2e-16 ***
## age                 268.199      9.955  26.942  < 2e-16 ***
## bmi                1426.240     48.036  29.691  < 2e-16 ***
## factor(smoker)no  19876.453   1711.631  11.613  < 2e-16 ***
## children            499.413    115.135   4.338 1.55e-05 ***
## bmi:smokerno      -1424.556     54.549 -26.115  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4935 on 1262 degrees of freedom
## Multiple R-squared:  0.8371, Adjusted R-squared:  0.8364 
## F-statistic:  1297 on 5 and 1262 DF,  p-value: < 2.2e-16
anova(best_model, interaction_model)

Hypothesis:

\(H_0:\) \(\beta_5\) = \(\beta_7\) = \(\beta_8\) = \(\beta_9\) = \(\beta_{12}\) = 0

\(H_a:\) At least one of the \(\beta's\) is not 0

Decision Rule:

Reject \(H_0\) if p-value < 0.05

Fail to reject \(H_0\) if p-value is \(\ge\) 0.05

Test Statistic:

p-value = 0.9926

Decision/Conclusion:

Turns out we are able to use the model as the partial F-test returned 0.9926, which means that we fail to reject the null hypothesis.